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Abstract 

In this work we introduce a mixture of GPs to address the data associa- 
tion problem, i.e. to label a group of observations according to the sources 
that generated them. Unlike several previously proposed GP mixtures, the 
novel mixture has the distinct characteristic of using no gating function to 
determine the association of samples and mixture components. Instead, all 
the GPs in the mixture are global and samples are clustered following "tra- 
jectories" across input space. We use a non-standard variational Bayesian 
algorithm to efficiently recover sample labels and learn the hyperparameters. 
We show how multi-object tracking problems can be disambiguated and also 
explore the characteristics of the model in traditional regression settings. 

Keywords: Gaussian Processes, Marginalized Variational Inference, 
Bayesian Models 



1. Introduction 

The data association problem arises in multi-target tracking scenarios. 
Given a set of observations that represent the positions of a number of moving 
sources, such as cars or airplanes, data association consists of inferring which 
observations originate from the same source [1, 2]. Data association is found 
in tracking problems for instance in computer vision [3], surveillance, sensor 
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networks [4] and radar tracking [5]. An example of data association with two 
sources is illustrated in Figure 1. 




(a) One-dimensional observations. (b) Solution obtained by the proposed 

method. 



Figure 1: Example of a multi-target tracking scenario. Data association aims to identify 
what observations correspond to each source. 

For a human observer, little effort is required to distinguish two noisy 
trajectories in this example, representing the paths followed by two objects 
in time. In this specific case, one observation of each target is available at 
each time instant, and the measurement instants are equally spaced in time, 
although neither of these properties are required in general. 

Typical multi-target tracking algorithms operate online. They include 
joint Kalman filters [6] and joint particle filters [7]. Given the predicted po- 
sitions of the targets and a number of candidate observed positions, they 
usually make instant data association decisions based on nearest-neighbor 
criteria or statistically more sophisticated approaches such as the Joint Prob- 
abilistic Data-Association Filter (JPDAF) [5, 7] or the Multiple Hypothesis 
Tracker (MHT) [6]. An important disadvantage of these classical techniques 
is that they usually require to determine a large number of parameters. 
This drawback motivated the development of several conceptually simpler 
approaches based on motion geometry heuristics [2, 8, 9]. However, these 
approaches are usually limited to specific scenarios, and they show difficul- 
ties in the presence of noise and when several trajectories cross each other. 

Most data association techniques can be significantly improved by post- 
poning decisions until enough information is available to exclude ambiguities 
[2], although this causes the number of possible trajectories to grow expo- 
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nentially. Some attempts have been made to restrain this combinatorial 
explosion, including the heuristic methods from [10, 11]. 

In this paper we present an algorithm based on Gaussian Processes that 
is able to consider all available data points in batch form whilst avoiding 
the exponential growth in potential tracks. As a result, it is capable to deal 
with difficult data association problems in which trajectories come very close 
and even cross each other. Furthermore, the algorithm does not require any 
knowledge about the model underlying the data, and it does not need time 
instants to be evenly spaced, nor to contain observations from all sources. 

Gaussian Processes (GPs) [12] are a powerful tool for Bayesian nonlin- 
ear regression. When combined in mixture models, GPs can be applied 
to describe data where there are local non-stationarities or discontinuities 
[13, 14, 15, 16]. The components of the mixture model are GPs and the 
prior probability of any given component is typically provided by a gating 
function. The role of the gating function is to dictate which GP is a priori 
most likely to be responsible for the data in any given region of the input 
space, i.e., the gating network forces each component of the GP mixture to 
be localized. 

In this work we follow a different approach, inspired by the data associa- 
tion problem. In particular, for any given location in input space there may 
be multiple targets, perhaps corresponding to multiple objects in a tracking 
system. We are interested in constructing a GP mixture model that can 
associate each of these targets with separate components. When there is 
ambiguity, the posterior distribution of targets will reflect this. We therefore 
propose a simple mixture model in which each component is global in its 
scope. The assignment of the data to each GP is performed sample-wise, 
independently of input space localization. In other words, no gating function 
is used. We call this model the Overlapping Mixture of GPs (OMGP). 

It has been brought to our attention that the proposed model bears re- 
semblance with the work of [17]. However, the focus of application is clearly 
different. In [17], the objective is to cluster a set of trajectories accord- 
ing to their similarity, whereas in this work we tackle the task of clustering 
observations into trajectories (a more demanding task, since only single ob- 
servations, as opposed to full trajectories, are available). Also, [17] uses a 
standard variational Bayesian algorithm, whereas in this work we take ad- 
vantage of non-standard variational algorithms [18, 19] to derive a tighter 
bound. 

The remainder of this paper is organized as follows: In Section 2 we pro- 
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vide a brief review of GPs in the regression setting. Section 3 first introduces 
the OMGP model and then discusses how to perform efficient learning, hy- 
perparameter selection, and predictions using this model. Experiments on 
several data sets are provided in Section 4. We wrap up in Section 5 with a 
brief discussion. 

2. Brief Review of Gaussian Processes 

In recent years, Gaussian Processes (GPs) have attracted a lot of attention 
due to their nice analytical properties and their state-of-the-art performance 
in regression tasks (see [20]). In this section we provide a brief summary of 
the main results for GP regression, see [12] for further details. 

Assume that a set of N multi-dimensional inputs and their corresponding 
scalar outputs, V = {x n ,y n }™ 1 , are available. The regression task is, given 
a new input x*, to obtain the predictive distribution for the corresponding 
observation y* based on V. 

The GP regression model assumes that the observations can be modeled 
as some noiseless latent function of the inputs plus independent noise y = 
/(x) +e, and then sets a zero-mean 1 GP prior on the latent function /(x) ~ 
QV(0, k(x, x')) and a Gaussian prior on e ~ A/"(0, a 2 ) on the noise, where 
k(x, x') is a covariance function and a 2 is a hyperparameter that specifies 
the noise power. 

The covariance function fc(x, x') specifies the degree of coupling between 
y(x) and y(x'), and it encodes the properties of the GP such as power 
level, smoothness, etc. One of the best-known covariance functions is the 
anisotropic squared exponential. It has the form of an unnormalized Gaus- 
sian, fc(x, x') = (Tpexp (— |x T A~ 1 x) and depends on the signal power a 2 
and the length-scales A, where A is a diagonal matrix containing one length- 
scale per input dimension. Each length-scale controls how fast the correlation 
between outputs decays as the separation along the corresponding input di- 
mension grows. We will collectively refer to all kernel parameters as 0. 

The joint distribution of the available observations (collected in y) and 
some unknown output 7/(x*) is a multivariate Gaussian distribution, with 



1 To make this assumption hold, the sample mean of the set {y(x„)}™ =1 is usually 
subtracted from data before proceeding further. 
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parameters specified by the covariance function: 
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where [K] nn > = fc(x n ,x n /), [k*]„ = fc(x n ,x*) and fc** = fc(x*, x#). I at is used 
to denote the identity matrix of size N. The notation [A] nn / refers to entry 
at row n, column n' of A. Likewise, [a] n is used to reference the n-th element 
of vector a. 

From (1) and conditioning on the observed training outputs we can obtain 
the predictive distribution 
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which is computable in 0(N 3 ) time, due to the inversion 2 of the N x N 
matrix K + ct 2 Iat. 

Hyperparameters {0, a} are typically selected by maximizing the marginal 
likelihood (also called "evidence") of the observations, which is 



logp(y|0,cr) 



V (K + a 2 !^) V 



IK 



a 2 l 



N 



log(2vr). (3) 



If analytical derivatives of (3) are available, optimization can be carried 
out using gradient methods, with each gradient computation taking 0(N 3 ) 
time. GP algorithms can typically handle a few thousand data points on a 
desktop PC. 

When dealing with multi-output functions, instead of a single set of ob- 
servations y, D sets are available, y 1 . . . y D , each corresponding to a different 
output dimension. In this case we can assume independence across the out- 
puts and perform the above procedure independently for each dimension. 
This will provide reasonable results for most problems, but if correlation be- 
tween different dimensions is expected, we can take advantage of this knowl- 
edge and model them jointly using multi-task covariance functions [21]. 



2 Of course, in a practical implementation, this inversion should never be performed 
explicitly, but through the use of the Cholesky factorization and the solution of the corre- 
sponding linear systems, see [12]. 
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3. Overlapping Mixtures of Gaussian Processes (OMGP) 

The overlapping mixture of Gaussian processes (OMGP) model assumes 
that there exist M different latent functions {.p"^(x)}m=i (which we will 
call "trajectories"), and that each output is produced by evaluating one of 
these functions at the corresponding input and by adding Gaussian noise to 
it. The association between samples and latent functions is determined by 
the N x M binary indicator matrix Z: Entry [Z] nm being non-zero specifies 
that n-th data point was generated using trajectory m. Only one non-zero 
entry per row is allowed in Z. 

To model multi-dimensional trajectories (i.e., when the mixture model has 
multiple outputs), D latent functions per trajectory can be used {/j (x)}^^ 
Note that there is no need to extend Z to specifically handle the multi-output 
case, since all the outputs corresponding to a single input are the same data 
point and must belong to the same trajectory. 

For convenience we will collect all the outputs in a single matrix Y = 
[Yi • • ■ Yd] an d a U the latent functions of trajectory m in a single matrix 
F M = [ f M _ _ _ f M]_ We wiU refer to all the latent f unc tions as {F (m) }. 

Given the above description, the likelihood of the OMGP model is 

N,M,D 

p(Y|{F( m )},Z) = \{ M{[Y] nd \[Y^] ndl a 2 )^ . (4) 

n=l,m=l,d=l 

Following the standard Bayesian framework, we place priors on the un- 
observed latent variables 

N,M M,D 

P(Z)= II P(F (m) |X) = J] AT(fi m) |0,K(-)), (5) 

n=l,m=l m=l,d=l 

i.e., a multinomial distribution over the indicators (in which ^ =1 [II] nm = 
1 V n ) and independent GP priors over each latent function. 3 We allow 
different covariance matrices for each trajectory. Though the multinomial 
distribution is specified here in its more general form, additional constraints 



3 If correlation between different trajectories is known to exist, trajectories can be jointly 
modeled as a single GP, using a covariance function that accounts for this dependence. 
This would increase the computational complexity of inference for this model, but the 
following derivations can still be applied. 
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are usually imposed, such as holding the prior probabilities constant for all 
data points. For the sake of clarity, we will omit the conditioning on the 
hyperparameters {0, II, a 2 }, which can be assumed to be known for the mo- 
ment. 

Unfortunately, the analytical computation of the posterior distribution 
p(Z, {F (m) }|X, Y) is intractable, so we will resort to approximate techniques. 

3.1. Variational approximation 

If the hyperparameters are known, it is possible to approximately com- 
pute the posterior using a variational approximation. We can use Jensen's 
inequality to construct a lower bound on the marginal likelihood as follows: 

. M 

logp(Y|X) = log / p(Y|{F (m) },Z)p(Z) JJp(F (m) |X)d{F (m) }dZ (6) 

J m=l 

> f^rH)^,,^"''"'''^^ 1 '^ d{F'-» }dz = c v 

~ J Vl J ' g({F (m) },Z) 1 ' 

Here £vb is a lower bound on logp(Y|X) for any variational distribu- 
tion g({F (m) },Z) and equality is attained if and only if g({F (m) },Z) = 
p(7i, {F^}|X, Y). Our objective is therefore to find a variational distri- 
bution that maximizes £vb; and thus becomes an approximation to the true 
posterior. We will restrict our search to variational distributions that factor- 
ize aS q({F^},Z) = q({F^})q(Z). 

If we assume that <?({F (m) }) is given (and therefore, also the marginals 
q^fi™-)^ _ jV^f^l^" 1 ^ x( m )) are available), it is possible to analytically max- 
imize £vb with respect to q{Z) by setting its derivative to zero and constrain- 
ing it to be a probability density. The optimal q{Z) is then: 

N,M 

q(Z) = Yl [rifr 1 with [fl] nm cc [n] nm exp(a nm ) (7) 

n=l,m=l 

with a nm = (-2^2 ((IyJ» - [^ m) ]™) 2 + [ S(m) ]™) - \ lQ g( 2 ^ 2 )) > 
d=i ^ ° ' 

where we see that the (approximate) posterior distribution over the indicators 
q{Z) factor izes for each sample. 
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Analogously, assuming q(7i) as known, it is possible to analytically ob- 
tain the distribution over the latent functions that maximizes £vb- For the 
OMGP model, this distribution factorizes both over trajectories and dimen- 
sions, and is given by 

q {i { r ) )=w { r ) \» { r\v {m) ) w 

with S (m) = (K" 1(m) + B^)" 1 and = E^B^y^ (8b) 

where B (m ' ) is a diagonal matrix with elements [n]i m /cr 2 . . . [n]Ar m /cr 2 . 

It is now possible to initialize g(Z) and q(i^) from their prior distribu- 
tions and iterate updates (7) and (8) to obtain increasingly refined approx- 
imations to the posterior. Since both steps are optimal with respect to the 
distribution that they compute, they are guaranteed to increase £vb, and 
therefore the algorithm is guaranteed to converge to a local maximum. 

Monotonous convergence can be monitored by computing £vb after each 
update. £vb can be expressed as 

£ VB = (logp(Y| { FW},Z)) i({pWhz) 

- KL(,({fW})||p({F«»')})) - KL(,(Z)||p(Z)) 
where the first term is given by 

N,M 

(logp(Y|{FM},Z)) =£[nUcW, 

' n,rn 

and the two remaining terms are the Kullback-Leibler (KL) divergences from 
the approximate posterior to the prior, which are straightforward to compute. 

Update (7) takes only O(NM) computation time, whereas (8) takes 
0(MN 3 ) time, due to the M matrix inversions. The presented model there- 
fore has the same limitations as conventional GPs regarding the size of the 
data sets that it can be applied to. However, when the posterior probability 
of some indicator [II] nm is close to zero, sample n no longer affects trajectory 
m and can be dropped in its computation, thus reducing the cost. Further- 
more, it is possible to use sparse GPs 4 to reduce this cost 5 to 0(MN) time 
by making use of the matrix inversion lemma. 



4 Such as the standard FITC approximation, described in [22] or the variational ap- 
proach introduced in [23]. 

5 Obviously, the cost also depends on the quality of the approximation by a constant 
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3.2. An improved variational bound for OMGP 

So far we have assumed that all the hyperparameters of the model are 
known. However, in practice, some procedure to select them is needed. The 
most straightforward way of achieving this would be to select them so as to 
maximize £vb> interleaving this procedure with updates (7) and (8). How- 
ever, when the quality of this bound is sensitive to changes of the model 
hyperparameters, this approach results in very slow convergence. A solution 
to this problem is described in [18] where the advantages of maximizing an 
alternative, tighter bound on the likelihood are shown. 

The improved bound proposed in [18] is still a lower bound on the like- 
lihood but it can be proved that it is also an upper bound on the standard 
variational bound £vb- As shown in [18], if we subtract £vb from the im- 
proved bound, the result takes on the form of a KL-divergence. This fact 
can be used both to show that it upper-bounds £vb (since KL-divergences 
are always positive) and to name the new bound, which is referred to as the 
KL-corrected variational bound. 

The KL-corrected bound for the OMGP model arises when the term 
log / p(Y|{F (m) }, Z)p(Z)dZ from the true marginal likelihood (6) is replaced 

with J q(Z) log p( y I{ f ^1' z )p( z ) ^z, which according to Jensen's inequality, 
constitutes a lower bound for any distribution q(Z): 

logp(Y|X) > log I f[ p(F™ |X)gf g(Z) log P(Y|{F lTzl' ZMZ) dZ d{F( m ) } = 

m=l 

M,D 

£corrVB= £ l«g 1 0, + B" 1 ^ ) 

m=l,d=l 

n N ' M (n 2\l-[fl] nm 

-KL(5(Z)||p(Z)) + £ £ log^ • 

n=l,m=l [lljnm 

The KL-corrected lower bound £corrVB can be computed analytically and 
has the advantage with respect to £vb, of depending only on q(Z) (and not 
g({F^})), since it is possible to integrate rim=iP(F < ' m ' > |X) out analytically. 

Bound £corrVB can be alternatively obtained by following the recent work 
in [19] and optimally removing g({F < - m - ) }) from the standard bound. In the 



factor. If the FITC approximation with r pseudo-inputs (or other rank-r approximation) 
is used, the computational complexity could be expressed as 0(MNr 2 ). 
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context of that work, £corrVB is referred to as the "marginalized variational 
bound" , and it is made clear that £corrVB corresponds simply to £vb when, 
for a given q(Z), the optimal choice for q({F^}) is made. In other words, for 
the same set of hyperparameters and the same q(Z), if one choses g({F (m) }) 
according to (8), both £vb and £corrVB would provide the same result. 

Thus, learning is performed simply by optimizing £corrVB with respect to 
g(Z) and the hyperparameters, iterating the following two steps: 

• E-Step: Updates (7) and (8) are alternated, which monotonically in- 
crease both £vb and £corrVB, until convergence. Hyperparameters are 
kept fixed. 

• M-Step: Gradient descent of £corrVB with respect to all hyperparame- 
ters is performed. Distribution g(Z) is kept fixed. 

Note that it is in the M-step where £corrVB becomes actually useful, since 
this improved bound remains more stable across different hyperparameter 
selections, due to it not depending on g({F (m ' ) }), as demonstrated in [18]. 

Of course, any strategy that maximizes £corrVB is valid, but we have 
found the above EM procedure to work well in practice. 

Computing £corrVB according to the provided expression without incur- 
ring in numerical errors can be challenging in practice, since several inver- 
sions, which maybe unstable, are needed. Also, note that can take 
arbitrarily small values and thus direct inversion may not be possible. An 
implementation-friendly expression where explicit inverses are avoided is 

M 1 D N 

£co„VB = £ ( - \ £ ||R^ T \(B^yr )|| 2 ~DJ2 log[R (m) ]n.n) 
m=l d=l n=l 

N,M 

-KL(g(Z)|b(Z))-- £ [n] nm log(27nx 2 ), 

n=l, m=l 

where 

RM = c hol(I + B (m) ^K (m) B (m) 5) 
and the backslash has the usual meaning of solution to a linear system. 6 



Expressions of the type C\c refer to the solution of the linear system Cx = c and are 
a numerically stable operation requiring only 0(N 2 ) time when C is triangular, which is 
the case here. 
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3.3. Predictive distributions 

The OMGP model can be used for a variety of tasks. In the data asso- 
ciation problem (i.e., clustering data into trajectories) the task at hand is 
to cluster observations into trajectories, which can be achieved by assign- 
ing each observation to the trajectory that more likely generated it, i.e., to 
assign label m* = argmax m [II] nm to the n-th observation, so no further 
computations are necessary. For other tasks, however, it can be necessary to 
obtain predictive distributions over the output space at new locations. Under 
the variational approximation, this predictive distributions can be computed 
analytically. 

The predictive distribution in the output dimension d corresponding to a 
new test input location x* can be expressed as 

M r 

p(^|x„X,Y) = / p (y^f^x^X) v (fi m) |x,Y)dfi m) 

m=l 
M . 

« YJ^U \ V (^|fi m) ,x„X) q (f< m >|X, Y)df< m > 

m=l ^ 
M 



with 



m=l 



/i M = k T (m) (KM + B M-x r i 

4 m) = a 2 + k» - kj^ (K< m > + B^)- 1 )- 1 kl m ) , 

i.e., a Gaussian mixture under the approximate posterior. The mixing fac- 
tors [II]* m are the prior probabilities of each component, one of the given 
hyperparameters of the model, and typically constant for all inputs. 

Note the correspondence of these predictive equations with the standard 
predictions for GP regression (2). The only difference is the noise compo- 
nent, which is scaled for each sample according to [II] In particular, as the 
posterior probability of a sample belonging to the current trajectory (some- 
times known as "responsibility") decays, the amount of noise associated to 
that sample is proportionally grown, thus reducing its effect on the posterior 
process. 
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Due to the reasons mentioned in the previous subsection, the predic- 
tive equations should not be implemented directly. Instead, the following 
numerically-stable expressions should be used: 

M M = kj {m) B (m) 5(R( m )\(RM T \(B( m ^yJ m) ))), 
a 2 }f = a 2 + k» - \\R^ T \(B^k^)\\ 2 . 

3.4- Batch versus online operation 

Though the description of OMGP is oriented towards batch data associ- 
ation tasks, this model can also be successfully applied to online tasks, by 
using a data set that grows over time. New samples are included as they 
arrive and the learning process is re-started, initializing it from the state 
that was obtained as a solution for the previous problem. Depending on the 
constraints of a given problem, many different optimizations can be made to 
avoid an explosion in computational effort, such as using low-rank updates. 

Note, however, that since in this model all the elements in each latent 
function form a fully connected graph, the Markovian property does not 
hold and the computation time required for each update is not constant. 
A possible workaround to achieve constant-time updates is to use constat- 
size data sets, for instance corresponding to a sliding window, and then 
perform low-rank updates to include and remove samples. However, we will 
not pursue that option in this work. 

4. Experiments 

In this section we investigate the behavior of OMGP both in data asso- 
ciation tasks and regression tasks, showing the versatility of this model. We 
use an implementation of OMGP in Matlab on a 3GHz, dual-core desktop 
PC with 4GB of memory, yielding executions times of the order of seconds 
for each experiment. 

4-1. Data association tasks 
4-1.1. Toy data 

We first apply OMGP to perform data association on a toy data set. The 
sources perform circular motions, one clockwise and one counterclockwise, 
as depicted in Fig. 2(a). The available observations represent the measured 
positions of the sources (which include Gaussian noise) at known time in- 
stants. However, it is not known which observed position corresponds to 
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which source. Since both trajectories are circles with the same center and 
radius, the sources cross each other twice per revolution, making the cluster- 
ing problem more difficult. However, as shown in Fig. 2(b), OMGP is capable 
of successfully identifying the unknown trajectories. Fig. 2(c) illustrates the 
uncertainty about the estimated labels. Specifically, it shows a decrease in 
the posterior probability of the correct labels whenever the two sources come 
close. 




20 40 60 80 100 120 140 160 180 200 

(c) Posterior probability of correct labels. 



Figure 2: (a) Observations for two sources that move in opposite circles, (b) The data 
association solution obtained by OMGP. (c) Posterior probability of the correct label for 
observations coming from source 1 (top) and 2 (bottom). 

4-1.2. Missile-to-air multi-target tracking 

Next, we consider a missile-to-air tracking scenario as described in [7]. 
The motion dynamics of this scenario are defined by the following state- 
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space equations: 



Is Tl 3 
3 I 3 




h(st) 




+ e t . 



. arctan( 7t^ ) . 



In this model, the state vector s t = [X t , Y t , Z t , V x> t, V y> t, V Z) t] contains the 
source position and velocity components, r t contains the observed measure- 
ments, T is the sampling interval, and I3 and O3 represent the 3x3 unity 
matrix and null matrix, respectively. The process noise w t and measurement 
noise e t are assumed Gaussian, v t G A/"(0, Q) and e t G A/"(0,R). For more 
details refer to [7]. The problem posed in [7] consists in tracking two sources 
and estimating their unknown state vector, given their correct initial states 
si = [6500,-1000,2000,-50,100,0] and s 2 = [5050,-450,2000,100,50,0]. 
We consider a more complex scenario by adding a third source, with initial 
state si = [8000, 500, 2000, -100, 0, 0], which passes close to one of the other 
sources at a certain instant. 

We apply the SIR/MCJPDA filter from [7] and OMGP to perform data 
association on the observations. The SIR/MCJPDA filter consists of a set of 
joint particle filters that perform tracking of multiple sources, combined with 
a joint probability data association (JPDA) technique which provides instan- 
taneous data association. The number of particles used in this experiment is 
25000. In order to operate correctly, the SIR/MCJPDA filter requires com- 
plete knowledge of the used state-space model and the initial state vectors Xq. 
Note that OMGP is completely blind in this regard. The OMGP algorithm 
is operated first in its incremental online setting. For illustration purposes, 
we also include results of the batch version of the OMGP algorithm. 

The trajectories obtained by each method can be found in Fig. 3, along 
with the predicted measurements. Although the SIR/MCJPDA filter initially 
performs correctly, it encounters difficulties at the point where the sources 
come close. After this point it shows erroneous assignments for at least one 
trajectory. Its mistakes are mainly due to its state vector depending only 
on 1 previous state, which proves insufficient if the sources are close during 
multiple consecutive measurements. The online version of OMGP does not 
show this problem. The smoothest solution is obtained by batch OMGP, 
which performs a global evaluation of the entire trajectories. 

To evaluate the performance of the algorithms, we measure the RMSE 
of each trajectory. These values can be found in Table 1, along with the 
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5000 5500 6000 6500 7000 7500 8000 8500 9000 5000 5500 6000 6500 7000 7500 8000 8500 9000 



(a) Trajectories identified by (b) Trajectories identified by OMGP, in- 
SIR/MCJPDA. cremental online version. 




5000 5500 6000 6500 7000 7500 8000 8500 9000 



(c) Trajectories identified by OMGP, 
batch solution. 



Figure 3: Missile-to-air data association problem with three sources. The starting point 
of each source is marked with a black dot. 



number of observations that are assigned to the wrong trajectories, n err , out 
of a total of 90 observations. As can be observed, both versions of the OMGP 
algorithm obtain superior results compared to SIR/MCJPA. Furthermore, 
while SIR/MCJPDA requires complete knowledge of the state-space model 
and the initial state vectors, OMGP does not require any knowledge of the 
underlying model. 

4-1.3. Interference alignment in OFDM wireless networks 

Interestingly, the data association problem can be found in contexts that 
go beyond standard multi-target tracking scenarios, such as digital commu- 
nications [24]. In the third experiment we apply OMGP to a data association 
problem that occurs in wireless communication networks. 
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Tabic 1: RMSE comparison on the missile-to-air data association problem. 



Algorithm 


RMSE #1 


RMSE #2 


RMSE #3 


Tl err 


SIR/MCJPDA 


292.46 


150.07 


258.14 


17 


OMGP (online) 


182.31 


151.46 


163.92 


6 


OMGP (batch) 


133.30 


80.23 


118.94 


1 



Interference alignment (IA) is a concept that has recently emerged as 
a solution to raise the capacity of wireless multiple-input multiple-output 
(MIMO) networks [25]. The underlying idea of IA along the spatial dimen- 
sions is that the interference from other transmitters must be aligned at each 
receiver in a subspace orthogonal to the signal space. In order to implement 
interference alignment in scenarios with multiple subcarriers, a digital filter 
must be applied at each transmit antenna. Here we will consider a 3-user 
interference channel with two antennas per node and OFDM modulation 
using N c subcarriers [26], which allows for two possible filter responses per 
subcarrier. Since only smooth frequency responses can be implemented, the 
smoothest solution of the 2 Nc possible choices should be selected. 

This combinatorial problem corresponds to a data association problem 
in which only the smoothest curve is of interest, (see Fig. 4(a)). The data 
used for this experiment consists of two simulated data sets and one data set 
obtained with a MIMO test bed setup 7 , each using 52 subcarriers. In Fig. 4 
we illustrate the solutions obtained by OMGP on these data sets. While 
the simulated data sets from Fig. 4(b) and Fig. 4(c) represent reasonably 
simple data association problems, the performance of OMGP on the real- 
world data set of Fig. 4(d) shows that it is capable of correctly distinguishing 
the smoothly-varying solution from the surrounding noisy data. As a matter 
of fact, we have been able to successfully implement OMGP in the IA setting 
for a parallel ungoing research project. 

4-2. Regression tasks 

We now consider application of the model in more standard regression 
tasks. In particular, we consider tasks where the target density is multimodal, 
which is the case when the data comes from multiple sources. 



7 See [27] for a full description of the used test bed. 
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Figure 4: Data association results obtained by OMGP on different interference alignment 
problems, (a) shows the IA solutions (imaginary part only) for the first simulated data 
set. (b) and (c) show the solutions for the first and second simulated data sets (real and 
imaginary part versus sub-carrier number), (d) shows the I A solution for a real- world data 
set. Note that complex values are simply simply treated as two-dimensional real data in 
this experiment. 



4-2.1. Multilevel regression 

Consider the data set from Fig. 5(a), which corresponds to observations 
from three independent functions. A normal GP would fail to produce valid 
multimodal outputs and previously proposed mixtures of GPs would restrict 
the component GPs to local parts of the space. OMGP can properly label 
each observation according to the generating function and provide multi- 
modal predictive distributions, as depicted in Fig. 5(b). 

Fig. 5 can also be interpreted as measurements of the position of three 
particles moving along one dimension, of which snapshots are taken at irreg- 
ular time intervals (horizontal axis). Each snapshot introduces noise in the 
position measurement and does not necessarily capture the position of all the 
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(a) Original data set. (b) Inferred labels and predictive log- 

probs. 



Figure 5: Posterior log-probability of the OMGP model and label inference. 

particles. In this case OMGP could be used to predict the position of any 
particle at any given point in time, as well as to properly label the samples 
in each snapshot. 

4-2.2. Robust regression 

Since each GP in the mixture can use a different covariance function, 
it is possible to use a GP to capture unrelated outliers and another one to 
interpolate the main function. This is easily achieved by a mixture of two 
GPs, one with the ARD-SE covariance function and another with k(x,x') = 
b 2 5(x,x'), i.e., white noise. We consider the problem of regression in a noisy 
sine in which some outliers have been introduced in Fig. 6 (top row). Observe 
how OMGP both identifies the outliers and ignores them, resulting in much 
better predictive means and variances. 

4-2.3. Heteroscedastic behavior 

Finally, Fig. 6 (bottom row) shows the results of running a GP and OMGP 
on the motorcycle data set from [28]. Two components have been identified, 
which might or might not correspond to two actual physical mechanisms al- 
ternatively producing observations. The predictive variances show improved 
behavior with respect to the standard GP. 
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(a) Noisy sine. Standard GP 



(b) Noisy sine. OMGP 




(c) Motorcycle. Standard GP 



(d) Motorcycle. OMGP 



Figure 6: Predictive means and variances for two different data sets. The shaded area 
denotes ±2 standard deviations around the mean. Top row: Noisy sine with outliers, (a) 
Standard GP and (b) OMGP with a noise-only component. (Only the predictive mean 
and variance of the signal component is depicted, which includes noise a 2 ). Bottom row: 
Silverman's motorcycle data set. 



5. Discussion and future work 

In this work we have introduced a novel GP mixture model inspired by 
multi-target tracking problems. The new model has the important difference 
with respect to previous approaches of using global mixture components and 
assigning samples to components by relying on their value in output space, 
instead of input space (as it is done when gating functions are used). 

A simple and efficient algorithm for inference relying on the variational 
Bayesian framework has been provided. The model can be applied in prac- 
tice due to the use of an improved, KL-corrected variational bound to learn 
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the hyperparameters. Direct optimization of this bound both to obtain an 
approximate posterior and to learn the hyperparameters will be considered 
in a further work. 

The OMGP model offers promising results when tracking moving targets, 
as has been illustrated experimentally in Section 4 and compares favorably 
with established methods in the field. Also, through imaginative application 
of the model using different covariance functions we were able to adapt the 
approach to robust regression and heteroscedastic noise. 

Naive implementation of GPs limits their applicability to only a few thou- 
sand data samples. However, recent advances in sparse approximations (e.g. 
[22, 23]) greatly should enable our approach to be applied to much larger 
data sets. 
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